Predicting Deciduous Broadleaf Forests using Climate Data
Author
Prepared by Alejandro Ordonez
Published
7 October 2025
1 Introduction
In this 90-minute practical, you will gain hands-on experience with multivariate ordination and classification techniques commonly used in ecology and biogeography. These methods help us understand how environmental variables structure biological communities and predict vegetation distributions.
Learning Objectives:
Apply Principal Component Analysis (PCA) to reduce dimensionality of environmental data
Perform Principal Coordinates Analysis (PCoA) and Non-metric Multidimensional Scaling (NMDS)
Use K-means clustering to classify vegetation types
Interpret ordination plots and assess classification accuracy
Predict the distribution of deciduous broadleaf forests based on climate variables
Dataset Overview:
You have three datasets:
BiomeData.csv - Site locations (x, y coordinates) and biome classifications (500 sites per vegetation type)
ClimateData.csv - 19 bioclimatic variables plus elevation for each site
WorldClimateData.csv - Global climate data for prediction (not used in this practical)
The bioclimatic variables (bio.1 to bio.19) represent annual trends, seasonality, and extreme environmental factors derived from temperature and precipitation data.
2 Introduction
In this 90-minute practical, you will gain hands-on experience with multivariate ordination and classification techniques commonly used in ecology and biogeography. These methods help us understand how environmental variables structure biological communities and predict vegetation distributions.
Learning Objectives:
Apply Principal Component Analysis (PCA) to reduce dimensionality of environmental data
Perform Principal Coordinates Analysis (PCoA) and Non-metric Multidimensional Scaling (NMDS)
Use K-means clustering to classify vegetation types
Interpret ordination plots and assess classification accuracy
Predict the distribution of deciduous broadleaf forests based on climate variables
Dataset Overview:
You have three datasets:
BiomeData.csv - Site locations (x, y coordinates) and biome classifications (500 sites per vegetation type)
ClimateData.csv - 19 bioclimatic variables plus elevation for each site
WorldClimateData.csv - Global climate data for prediction (not used in this practical)
The bioclimatic variables (bio.1 to bio.19) represent annual trends, seasonality, and extreme environmental factors derived from temperature and precipitation data.
WorldClim Bioclimatic Variables
Variable
Name
Description
Units
bio.1
Annual Mean Temperature
Mean of all weekly mean temperatures
°C
bio.2
Mean Diurnal Range
Mean of monthly (max temp - min temp)
°C
bio.3
Isothermality
(bio.2/bio.7) × 100 - day-to-night temperature oscillations relative to annual oscillations
%
bio.4
Temperature Seasonality
Standard deviation of weekly mean temperatures
°C × 100
bio.5
Max Temperature of Warmest Month
Highest temperature of any monthly maximum temperature
°C
bio.6
Min Temperature of Coldest Month
Lowest temperature of any monthly minimum temperature
°C
bio.7
Temperature Annual Range
bio.5 - bio.6
°C
bio.8
Mean Temperature of Wettest Quarter
Mean temperature during the wettest quarter
°C
bio.9
Mean Temperature of Driest Quarter
Mean temperature during the driest quarter
°C
bio.10
Mean Temperature of Warmest Quarter
Mean temperature during the warmest quarter
°C
bio.11
Mean Temperature of Coldest Quarter
Mean temperature during the coldest quarter
°C
bio.12
Annual Precipitation
Sum of monthly precipitation values
mm
bio.13
Precipitation of Wettest Month
Precipitation of the wettest month
mm
bio.14
Precipitation of Driest Month
Precipitation of the driest month
mm
bio.15
Precipitation Seasonality
Coefficient of variation of monthly precipitation
%
bio.16
Precipitation of Wettest Quarter
Total precipitation during the wettest quarter
mm
bio.17
Precipitation of Driest Quarter
Total precipitation during the driest quarter
mm
bio.18
Precipitation of Warmest Quarter
Total precipitation during the warmest quarter
mm
bio.19
Precipitation of Coldest Quarter
Total precipitation during the coldest quarter
mm
Note: Temperature variables are in degrees Celsius (°C), precipitation variables are in millimeters (mm). Quarter refers to a period of 3 consecutive months.
3 Task 1: Principal Component Analysis (PCA)
3.1 Background and Objectives
What is PCA?
Principal Component Analysis is an ordination technique that transforms correlated environmental variables into a smaller set of uncorrelated variables called principal components (PCs). Each PC is a linear combination of the original variables and captures a decreasing amount of variance in the dataset.
Why use PCA?
Reduces dimensionality while retaining most variation in the data
Identifies which environmental variables drive differences between sites
Visualizes multidimensional data in 2D or 3D space
Works best with linear relationships between variables
What we will do:
Load and prepare the climate and biome data
Standardize environmental variables (important because variables have different units)
Perform PCA on the environmental data
Determine how many PC axes to retain using the “elbow method” and “broken stick” model
Interpret variable loadings to understand what each axis represents
Visualize the ordination space colored by biome type
Add convex hulls to show the environmental space occupied by each biome
3.2 Step 1.1: Load and Prepare Data
What this step does: In this step, we import two CSV files containing biome classifications and climate data, merge them by their common ID column so that each site has both its biome label and all environmental variables, and extract just the environmental variables (the 19 bioclimatic variables plus elevation) into a separate data frame for analysis. We also check for any missing values that could cause problems in our analysis.
Why we do this: Before running any analysis, we need to combine our ecological classification data (biomes) with the environmental predictors (climate variables). By merging the datasets, we create a complete picture where each site has both its identity (what biome it belongs to) and its environmental conditions. Checking for missing values is essential because most statistical methods cannot handle missing data and would either fail or produce unreliable results.
# Merge datasets by IDfull_data <-merge(biome_data, climate_data, by ="ID")# Extract environmental variables (bio.1 to Elev)env_vars <- full_data[, c("bio.1", "bio.2", "bio.3", "bio.4", "bio.5", "bio.6", "bio.7", "bio.8", "bio.9", "bio.10","bio.11", "bio.12", "bio.13", "bio.14", "bio.15","bio.16", "bio.17", "bio.18", "bio.19", "Elev")]# Check for missing valuessum(is.na(env_vars))
[1] 40
# Remove rows with missing values env_vars <- env_vars[!is.na(apply(env_vars,1,sum)),]
3.3 Step 1.2: Perform PCA
What this step does: Here we perform the actual Principal Component Analysis using R’s prcomp() function with scale. = TRUE, which standardizes each variable to have mean = 0 and standard deviation = 1. We then calculate how much variance each principal component explains and display this information in a summary table showing both individual and cumulative variance explained.
Why we do this: Scaling is critical because our bioclimatic variables have very different units and scales (e.g., temperature in °C ranging from -50 to 50 vs precipitation in mm ranging from 0 to 10,000). Without scaling, variables with larger numeric ranges would dominate the PCA simply due to their scale, not their ecological importance. Standardization ensures all variables contribute equally to the analysis. Understanding variance explained helps us know how much information each PC captures and guides decisions about how many components to retain.
# Standardize the data (scale = TRUE centers and scales variables)pca_result <-prcomp(env_vars, scale. =TRUE)# Summary of PCAsummary(pca_result)
What this step does: This step helps us decide how many principal components to keep for further analysis using two complementary methods:
Scree Plot (Elbow Method): We plot the variance explained by each PC and look for where the curve “flattens out” - this is the elbow. Components before the elbow capture meaningful patterns; those after mostly capture noise.
Broken Stick Model: This creates a null expectation by asking “if variance were distributed randomly among components, how much would each get?” We only keep components that explain MORE variance than this random expectation.
Why we do this: Not all principal components are equally informative. The first few PCs capture major patterns in the data, while later PCs often represent noise or very minor variations. Retaining too many components defeats the purpose of dimensionality reduction and can introduce noise into downstream analyses. Retaining too few loses important information. These two methods provide objective, complementary criteria: the elbow method is visual and intuitive, while the broken stick model provides a statistical threshold. Together, they help us find the optimal balance between simplification and information retention.
Elbow Method: Look for the “elbow” in the scree plot where variance explained drops off
Broken Stick Model: Compare observed variance to a null model where variance is randomly distributed
# Broken stick model - By handbroken_stick <-function(n) { result <-numeric(n)for (i in1:n) { result[i] <- (1/n) *sum(1/(i:n)) }return(result *100)}bs_values <-broken_stick(20)# Plot comparisonplot(1:20, variance_explained[1:20], type ="b", pch =19, col ="blue",xlab ="Principal Component", ylab ="Variance Explained (%)",main ="PCA: Observed vs Broken Stick Model",ylim =c(0, max(variance_explained[1:20])))lines(1:20, bs_values, type ="b", pch =17, col ="red")legend("topright", legend =c("Observed", "Broken Stick"),col =c("blue", "red"), pch =c(19, 17), lty =1)grid()
# Number of axes above broken stickn_axes_retain <-sum(variance_explained > bs_values)cat("\nNumber of PCs to retain (Broken Stick):", n_axes_retain)
Number of PCs to retain (Broken Stick): 3
3.5 Step 1.4: Interpret Variable Loadings
What this step does: Variable loadings tell us which environmental variables contribute most to each principal component. A loading is the correlation between the original variable and the PC axis. In this step, we extract the loadings for the first 3 PCs, print the complete loading matrix, and identify the top 5 contributing variables for each PC by sorting their absolute values.
Why we do this: Raw principal components are abstract mathematical constructs - they’re just weighted combinations of our original variables. Interpretation transforms these abstractions into ecologically meaningful environmental gradients. For example, if PC1 has high loadings from temperature variables, we can interpret it as a “temperature gradient.” If PC2 loads heavily on precipitation seasonality variables, it represents “precipitation variability.” This interpretation is crucial for understanding what ecological or climatic factors actually differentiate our sites and biomes. High positive or negative loadings (far from zero) indicate strong contributions to that axis.
# Extract loadings for first 3 PCsloadings_pc <- pca_result$rotation[, 1:3]print(round(loadings_pc, 3))
What this step does: We visualize the PCA results by plotting the sites in the reduced ordination space. Each point represents one of our 8,000 sites, and we color points by their biome type. The code creates three 2D plots showing different combinations of principal components: PC1 vs PC2, PC1 vs PC3, and PC2 vs PC3.
Why we do this: Visualization is essential for understanding the relationships among sites and biomes in environmental space. Points that are close together have similar environmental conditions, while distant points differ substantially. By coloring points by biome, we can assess whether sites from the same biome cluster together (suggesting distinct environmental niches) or mix with other biomes (suggesting overlapping environmental tolerances). Different PC combinations reveal different aspects of environmental variation - PC1 vs PC2 usually shows the dominant gradients, while other combinations can reveal secondary patterns. The axis labels showing variance explained help you understand how much information each view contains.
What this step does: Convex hulls are polygons that enclose all points belonging to each biome, like drawing a rubber band around each group. The chull() function finds the outermost points for each biome group, and polygon() draws lines connecting them. We draw hulls without fill (using col = NA) so you can see overlapping regions clearly.
Why we do this: Convex hulls help visualize three important ecological patterns: 1. The total environmental space occupied by each biome - larger hulls indicate biomes that span wider environmental gradients 2. How much different biomes overlap - overlapping hulls suggest similar environmental tolerances or transitional zones 3. Which biomes are environmentally distinct - non-overlapping hulls indicate biomes occupying unique environmental niches
This visualization is powerful for understanding niche differentiation and ecological segregation. Biomes with completely separated hulls occupy fundamentally different environmental spaces, while those with extensive overlap may be differentiated by factors beyond climate (such as soil, disturbance history, or biogeographic barriers) or may represent arbitrary divisions along continuous gradients.
# Function to add convex hullsadd_hulls <-function(scores_x, scores_y, groups, colors) {for (i inseq_along(unique(groups))) { group <-unique(groups)[i] subset_idx <-which(groups == group) hull_idx <-chull(scores_x[subset_idx], scores_y[subset_idx]) hull_points <- subset_idx[hull_idx]polygon(scores_x[hull_points], scores_y[hull_points],border = colors[i], lwd =2, col =NA) }}# Plot with convex hullspar(mfrow =c(2, 2), mar =c(4, 4, 2, 1))# PC1 vs PC2 with hullsplot(pc_scores[, 1], pc_scores[, 2],col = color_map[full_data$BIOME.Name],pch =16, cex =0.6,xlab =paste0("PC1 (", round(variance_explained[1], 1), "%)"),ylab =paste0("PC2 (", round(variance_explained[2], 1), "%)"),main ="PCA with Convex Hulls: PC1 vs PC2")add_hulls(pc_scores[, 1], pc_scores[, 2], full_data$BIOME.Name[as.numeric(row.names(env_vars))], colors)# PC1 vs PC3 with hullsplot(pc_scores[, 1], pc_scores[, 3],col = color_map[full_data$BIOME.Name],pch =16, cex =0.6,xlab =paste0("PC1 (", round(variance_explained[1], 1), "%)"),ylab =paste0("PC3 (", round(variance_explained[3], 1), "%)"),main ="PCA with Convex Hulls: PC1 vs PC3")add_hulls(pc_scores[, 1], pc_scores[, 3], full_data$BIOME.Name[as.numeric(row.names(env_vars))], colors)# Add legendplot.new()legend("topright", legend = biome_types, col = colors, pch =16, cex =1)
3.8 Reflection Questions - Task 1
Question 1.1: Based on the scree plot and broken stick model, how many principal components would you retain for further analysis? Explain your reasoning and discuss what percentage of the total variance is explained by these components.
Question 1.2: Examine the variable loadings for the first two principal components. What environmental gradients do PC1 and PC2 appear to represent? Which bioclimatic variables contribute most strongly to each axis, and what does this tell you about the main environmental factors differentiating these biomes?
4 Task 2: Principal Coordinates Analysis (PCoA)
4.1 Background and Objectives
What is PCoA?
Principal Coordinates Analysis (also called metric multidimensional scaling) is an ordination technique that works with distance or dissimilarity matrices rather than raw data. Unlike PCA, which assumes linear relationships, PCoA can handle non-linear relationships through the choice of dissimilarity measure.
Why use Euclidean distance?
Euclidean distance is the straight-line distance between two points in multidimensional space. It’s appropriate for continuous environmental data and, when used with scaled variables, produces results similar to PCA but through a distance-based framework. This allows us to understand how distance-based ordination methods work while maintaining comparability with our PCA results.
What we will do:
Scale environmental variables to ensure equal contribution
Calculate Euclidean distance matrix
Perform PCoA on the distance matrix
Determine how many PCo axes to retain
Visualize the ordination colored by biome
Add convex hulls to delineate biome spaces
4.2 Step 2.1: Scale Environmental Data
What this step does: We standardize each environmental variable to have mean = 0 and standard deviation = 1 using the scale() function. This transformation ensures that all variables are on comparable scales before calculating distances.
Why we do this: Just like in PCA, scaling is crucial when working with variables measured in different units and ranges. Without scaling, variables with larger numeric ranges (like precipitation in mm) would dominate the distance calculations over variables with smaller ranges (like temperature in °C). Scaling ensures that each environmental variable contributes proportionally to the distance matrix based on its ecological variation, not its measurement units. This prevents arbitrary scale differences from distorting our understanding of ecological relationships.
# Scale environmental variablesenv_scaled <-scale(env_vars)# Check the transformationcat("Original data - first 5 rows and columns:\n")
What this step does: We calculate the Euclidean distance between every pair of sites using the scaled environmental data. The Euclidean distance between two sites is the square root of the sum of squared differences across all environmental variables. R’s dist() function efficiently computes this for all pairwise combinations.
Why we do this: The Euclidean distance matrix quantifies how environmentally similar or different each pair of sites is. Sites with small distances have similar environmental conditions across all climate variables; sites with large distances have very different conditions. This distance matrix becomes the input for PCoA, which will find a coordinate system that best preserves these distances. Unlike PCA which works directly with the data matrix, distance-based methods like PCoA offer flexibility in how we measure similarity - we could use other distance metrics for different data types or ecological questions.
# Convert to matrix for inspectiondist_matrix <-as.matrix(euclidean_dist)# Show example distances between first few sitescat("\nExample: Distances between first 5 sites:\n")
Distance matrix summary:
Dimensions: 1598 x 1598
Min distance: 0.035
Max distance: 20.283
Mean distance: 5.727
Median distance: 5.282
4.4 Step 2.3: Perform PCoA
What this step does: Principal Coordinates Analysis (PCoA) takes our distance matrix and finds a coordinate system where the Euclidean distances between points best match the original environmental distances. We use the cmdscale() function (classical multidimensional scaling) to perform PCoA, requesting 20 dimensions and eigenvalues.
Why we do this: PCoA transforms the distance matrix into an ordination space where we can visualize relationships among sites. It finds axes (principal coordinates) that maximize the correspondence between distances in the ordination and the original environmental distances. Unlike PCA which decomposes variance in the data matrix, PCoA decomposes distances. However, when using Euclidean distance on scaled data, PCoA produces results mathematically equivalent to PCA - this demonstrates the connection between these two fundamental ordination methods. PCoA’s distance-based framework also makes it more flexible for non-Euclidean distances used with other data types.
# Perform PCoA using cmdscalecat("Performing PCoA...\n")
cat("\nNote: With Euclidean distance on scaled data, PCoA results are")
Note: With Euclidean distance on scaled data, PCoA results are
cat("\nmathematically equivalent to PCA. Compare these values to your PCA results!\n")
mathematically equivalent to PCA. Compare these values to your PCA results!
4.5 Step 2.4: Determine Number of Axes to Retain
What this step does: We use the same two methods as in PCA to determine how many PCoA axes to retain: the scree plot (elbow method) to visualize where variance levels off, and the broken stick model to identify axes that explain more variance than expected by chance.
Why we do this: Not all principal coordinates are equally informative. The first few axes capture the major environmental gradients differentiating sites, while later axes represent progressively smaller patterns or noise. The broken stick model provides an objective statistical threshold - only axes exceeding this null expectation represent meaningful patterns rather than random variation. This step ensures we focus on the most important axes for downstream analyses like clustering, avoiding the inclusion of noise that could obscure true patterns. The scree plot gives visual intuition, while the broken stick provides quantitative guidance.
par(mfrow=c(1,1))# Scree plotplot(1:20, variance_exp_pcoa[1:20], type ="b", pch =19, col ="darkgreen",xlab ="Principal Coordinate", ylab ="Variance Explained (%)",main ="PCoA Scree Plot")grid()
# Broken stick comparisonbs_values_pcoa <-broken_stick(20)plot(1:20, variance_exp_pcoa[1:20], type ="b", pch =19, col ="darkgreen",xlab ="Principal Coordinate", ylab ="Variance Explained (%)",main ="PCoA: Observed vs Broken Stick Model",ylim =c(0, max(variance_exp_pcoa[1:20])))lines(1:20, bs_values_pcoa, type ="b", pch =17, col ="red")legend("topright", legend =c("Observed", "Broken Stick"),col =c("darkgreen", "red"), pch =c(19, 17), lty =1)grid()
n_axes_pcoa <-sum(variance_exp_pcoa[1:20] > bs_values_pcoa)cat("\nNumber of PCo axes to retain (Broken Stick):", n_axes_pcoa, "\n")
Number of PCo axes to retain (Broken Stick): 3
cat("Cumulative variance explained by retained axes:", round(cumulative_var_pcoa[n_axes_pcoa], 2), "%\n")
Cumulative variance explained by retained axes: 79.96 %
4.6 Step 2.5: Visualize PCoA Ordination
What this step does: We visualize the PCoA ordination space by plotting sites using their coordinates on different pairs of PCo axes, with points colored by biome type. We create three different 2D views: PCo1 vs PCo2, PCo1 vs PCo3, and PCo2 vs PCo3.
Why we do this: Visualization reveals the spatial arrangement of sites and biomes in environmental space. Points positioned close together have similar environmental conditions (small Euclidean distances), while distant points differ substantially in their climate profiles. By coloring points by biome, we can assess whether sites from the same biome cluster together in environmental space - indicating distinct climatic niches - or intermix with other biomes - suggesting overlapping environmental tolerances or transitional zones. Different axis combinations reveal different aspects of environmental variation: PCo1 vs PCo2 typically shows the dominant gradients (temperature, precipitation), while other combinations can reveal secondary patterns. The variance percentages on axis labels help you judge how much ecological information each view contains.
cat("\nNote: Because we used Euclidean distance on scaled data, these plots")
Note: Because we used Euclidean distance on scaled data, these plots
cat("\nshould look very similar to your PCA plots (possibly reflected or rotated).\n")
should look very similar to your PCA plots (possibly reflected or rotated).
4.7 Reflection Questions - Task 2
Question 2.1: Compare the PCoA ordination to the PCA ordination. Since we used Euclidean distance on scaled data, the results should be mathematically equivalent (though possibly reflected or rotated). Do you observe this equivalence? What does this tell you about the relationship between PCA and distance-based ordination methods?
Question 2.2: Looking at the convex hulls in the PCoA plots, which biomes show the most overlap in environmental space? Which are most distinct? What might this tell you about the environmental specificity of different vegetation types? Consider both the size of the hulls (environmental breadth) and their positions (environmental optima).
Non-metric Multidimensional Scaling is an iterative ordination technique that attempts to represent dissimilarities between objects in a low-dimensional space. Unlike PCA and PCoA, NMDS uses rank orders rather than absolute distances, making it robust to non-linear relationships.
Understanding Stress:
Stress measures how well the ordination distances match the original dissimilarities. Lower stress indicates better fit: - Stress < 0.05: Excellent representation - Stress < 0.10: Good representation - Stress < 0.20: Acceptable - Stress > 0.20: Poor representation (use more dimensions)
What we will do:
Load the vegan package for robust NMDS implementation
Perform NMDS with different numbers of dimensions (2, 3, 4, 5)
Plot how stress changes with dimensionality
Select the optimal number of dimensions
Visualize the NMDS ordination colored by biome
Add convex hulls to show biome distributions
5.2 Step 3.1: Run NMDS with Different Dimensions
What this step does: We run NMDS multiple times using 2, 3, 4, and 5 dimensions. For each dimensionality, metaMDS() performs multiple random starts (tries = 20) to find the best solution, iterating until stress stabilizes. The function returns the configuration (site positions) and stress value for each dimensionality.
Why we do this: Unlike PCA/PCoA where the number of meaningful axes is determined by the data’s variance structure, in NMDS you must specify the number of dimensions beforehand. Different dimensionalities trade off between simplicity and goodness-of-fit. Testing multiple dimensions helps us find the optimal balance: too few dimensions yield high stress (poor fit), while too many dimensions add complexity without meaningful improvement. Multiple random starts are crucial because NMDS optimization can get stuck in local minima - metaMDS() runs many attempts and keeps the best solution, ensuring we find a global (or near-global) optimum.
# Run NMDS with different numbers of dimensionsdimensions <-c(2, 3, 4, 5)nmds_results <-list()stress_values <-numeric(length(dimensions))cat("Running NMDS with different numbers of dimensions...\n","(Each run performs multiple random starts - this may take a moment)\n\n","This will take a long time\n")
Running NMDS with different numbers of dimensions...
(Each run performs multiple random starts - this may take a moment)
This will take a long time
for (i inseq_along(dimensions)) { k <- dimensions[i]cat("Running NMDS with k =", k, "dimensions...\n")# Run metaMDS with specified dimensions# trace = FALSE suppresses iteration output# autotransform = FALSE because we already scaled our data# trymax = 20 performs 20 random starts nmds_results[[i]] <-metaMDS(euclidean_dist, k = k, trace =FALSE, autotransform =FALSE,trymax =20) stress_values[i] <- nmds_results[[i]]$stresscat(" Final stress =", round(stress_values[i], 4), "\n")# Interpret stressif (stress_values[i] <0.05) {cat(" Interpretation: Excellent representation\n\n") } elseif (stress_values[i] <0.10) {cat(" Interpretation: Good representation\n\n") } elseif (stress_values[i] <0.20) {cat(" Interpretation: Acceptable representation\n\n") } else {cat(" Interpretation: Poor representation - consider more dimensions\n\n") }}
Running NMDS with k = 2 dimensions...
Final stress = 0.1088
Interpretation: Acceptable representation
Running NMDS with k = 3 dimensions...
Final stress = 0.0664
Interpretation: Good representation
Running NMDS with k = 4 dimensions...
Final stress = 0.0423
Interpretation: Excellent representation
Running NMDS with k = 5 dimensions...
Final stress = 0.0246
Interpretation: Excellent representation
5.3 Step 3.2: Plot Stress vs Number of Dimensions
What this step does: We create a stress plot showing how goodness-of-fit changes with dimensionality. The plot includes reference lines at stress = 0.05, 0.10, and 0.20 to help interpret fit quality. We also identify the minimum number of dimensions needed to achieve acceptable stress (< 0.20).
Why we do this: The stress plot reveals the trade-off between model complexity and fit quality. It helps us answer: “How many dimensions do we need?” Key considerations: - Diminishing returns: After a certain point, additional dimensions improve stress only marginally - Interpretability: 2-3 dimensions are easy to visualize; higher dimensions are harder to interpret - Overfitting: Too many dimensions can fit noise rather than signal
The standard thresholds (0.05, 0.10, 0.20) come from decades of ecological practice. Stress < 0.10 is generally considered good for reliable interpretation, while stress > 0.20 suggests the ordination may be misleading. The optimal choice balances statistical fit with practical interpretability - often this means accepting slightly higher stress to gain easier visualization in 2-3 dimensions.
# Create stress plotplot(dimensions, stress_values,type ="b", pch =19, col ="purple", lwd =2,xlab ="Number of Dimensions",ylab ="Stress",main ="NMDS: Stress vs Dimensionality",ylim =c(0, max(stress_values) *1.1))# Add reference linesabline(h =0.20, lty =2, col ="red", lwd =2)abline(h =0.10, lty =2, col ="orange", lwd =2)abline(h =0.05, lty =2, col ="green", lwd =2)# Add labelstext(max(dimensions), 0.20, "Acceptable (0.20)", pos =2, col ="red", cex =0.9)text(max(dimensions), 0.10, "Good (0.10)", pos =2, col ="orange", cex =0.9)text(max(dimensions), 0.05, "Excellent (0.05)", pos =2, col ="green", cex =0.9)grid()
# Determine optimal dimensionsif (any(stress_values <0.20)) { optimal_k <- dimensions[which(stress_values <0.20)[1]]cat("\nOptimal number of dimensions (stress < 0.20):", optimal_k, "\n")} else {cat("\nWarning: No dimensionality achieved stress < 0.20\n")cat("Consider testing higher dimensions or using a different distance metric.\n") optimal_k <- dimensions[which.min(stress_values)]cat("Using k =", optimal_k, "with lowest stress =", round(min(stress_values), 4), "\n")}
Optimal number of dimensions (stress < 0.20): 2
cat("\nStress values summary:\n")
Stress values summary:
for (i inseq_along(dimensions)) {cat(" k =", dimensions[i], ": stress =", round(stress_values[i], 4), "\n")}
k = 2 : stress = 0.1088
k = 3 : stress = 0.0664
k = 4 : stress = 0.0423
k = 5 : stress = 0.0246
5.4 Step 3.3: Visualize NMDS Ordination
What this step does: We visualize the NMDS solution using the dimensionality that achieved acceptable stress (typically 5 dimensions for visualization purposes). We plot the first three axes in three different 2D combinations: NMDS1 vs NMDS2, NMDS1 vs NMDS3, and NMDS2 vs NMDS3, with points colored by biome type.
Why we do this: Visualization reveals whether NMDS successfully arranges sites so that environmentally similar locations are close together and dissimilar ones are far apart. By coloring points by biome, we assess whether NMDS ordination space reflects ecological classification - strong clustering by biome suggests the climate variables that drive NMDS axes also drive vegetation patterns.
Important differences from PCA/PCoA: 1. No variance explained: NMDS axes don’t capture proportions of variance - all axes are equally important 2. Arbitrary rotation: The orientation and flip of NMDS axes is arbitrary; only relative positions matter 3. Rank-based: NMDS preserves rank order of dissimilarities, not absolute distances, making it robust to non-linear relationships
The stress value shown in the plot title indicates trustworthiness - lower stress means the 2D projection accurately represents the multidimensional relationships. Multiple views (different axis pairs) show different aspects of the ordination space.
# Select the NMDS result with 5 dimensions for visualization# (or use optimal_k if it's different)nmds_k5 <- nmds_results[[which(dimensions ==5)]]nmds_scores <- nmds_k5$points# Create visualizationspar(mfrow =c(2, 2), mar =c(4, 4, 2, 1))# NMDS1 vs NMDS2plot(nmds_scores[, 1], nmds_scores[, 2],col = color_map[full_data$BIOME.Name],pch =16, cex =0.6,xlab ="NMDS1",ylab ="NMDS2",main =paste0("NMDS: Axis 1 vs 2 (Stress = ", round(nmds_k5$stress, 3), ")"))# NMDS1 vs NMDS3plot(nmds_scores[, 1], nmds_scores[, 3],col = color_map[full_data$BIOME.Name],pch =16, cex =0.6,xlab ="NMDS1",ylab ="NMDS3",main ="NMDS: Axis 1 vs 3")# NMDS2 vs NMDS3plot(nmds_scores[, 2], nmds_scores[, 3],col = color_map[full_data$BIOME.Name],pch =16, cex =0.6,xlab ="NMDS2",ylab ="NMDS3",main ="NMDS: Axis 2 vs 3")#legendplot.new()legend("topright", legend = biome_types, col = colors, pch =16, cex =1)
cat("\nNote: NMDS axes don't have inherent meaning or variance values.\n")
Note: NMDS axes don't have inherent meaning or variance values.
cat("All axes are equally valid - they represent different dimensions of the dissimilarity space.\n")
All axes are equally valid - they represent different dimensions of the dissimilarity space.
cat("The orientation (rotation/reflection) of the plot is arbitrary.\n")
The orientation (rotation/reflection) of the plot is arbitrary.
5.5 Step 3.4: Add Convex Hulls to NMDS
What this step does: We add convex hulls (polygons enclosing all points) around each biome in the NMDS ordination space. The hulls are drawn without fill to clearly show overlapping regions between different biomes.
Why we do this: Convex hulls reveal important ecological patterns in ordination space:
Biome distinctiveness: Non-overlapping hulls indicate biomes with unique environmental signatures that NMDS successfully separates; overlapping hulls suggest similar environmental conditions or transitional zones
Environmental breadth: Large hulls show biomes spanning wide environmental gradients (generalists); small hulls indicate narrow environmental tolerances (specialists)
Niche relationships: The spatial arrangement of hulls shows which biomes occupy similar vs. different environmental niches
Comparing across methods: By now you’ve created hull plots for all three ordination techniques (PCA, PCoA, NMDS). Comparing them reveals: - Do all methods separate biomes similarly? This suggests robust, method-independent patterns - Do some biomes separate better in one method? PCA captures linear gradients, PCoA preserves distances, NMDS preserves rank orders - different biome-environment relationships favor different methods - Which method shows clearest separation overall? This suggests which mathematical framework best matches your data structure
Differences between methods illuminate whether biome-climate relationships are primarily linear (PCA works best), distance-based (PCoA works best), or rank-ordered (NMDS works best).
par(mfrow =c(2, 2), mar =c(4, 4, 2, 1))# NMDS1 vs NMDS2 with hullsplot(nmds_scores[, 1], nmds_scores[, 2],col = color_map[full_data$BIOME.Name],pch =16, cex =0.6,xlab ="NMDS1",ylab ="NMDS2",main ="NMDS with Convex Hulls: Axis 1 vs 2")add_hulls(nmds_scores[, 1], nmds_scores[, 2], full_data$BIOME.Name[as.numeric(row.names(env_vars))], colors)# NMDS1 vs NMDS3 with hullsplot(nmds_scores[, 1], nmds_scores[, 3],col = color_map[full_data$BIOME.Name],pch =16, cex =0.6,xlab ="NMDS1",ylab ="NMDS3",main ="NMDS with Convex Hulls: Axis 1 vs 3")add_hulls(nmds_scores[, 1], nmds_scores[, 3], full_data$BIOME.Name[as.numeric(row.names(env_vars))], colors)# legendplot.new()legend("topright", legend = biome_types, col = colors, pch =16, cex =1)cat("\nCompare these convex hulls to your PCA and PCoA hulls from Tasks 1 and 2.\n")
Compare these convex hulls to your PCA and PCoA hulls from Tasks 1 and 2.
cat("Similarities suggest robust patterns; differences reveal which method best captures biome-environment relationships.\n")
Similarities suggest robust patterns; differences reveal which method best captures biome-environment relationships.
5.6 Reflection Questions - Task 3
Question 3.1: Based on the stress values you calculated, which number of dimensions provides the best balance between model complexity and goodness-of-fit? Would you consider the representation with your chosen number of dimensions to be excellent, good, or acceptable according to standard stress guidelines? Explain your reasoning.
Question 3.2: Compare the three ordination methods (PCA, PCoA, and NMDS) in terms of how clearly they separate the different biomes. Which method appears most effective for distinguishing vegetation types? Consider both the visual separation in ordination space and the conceptual differences between methods (linear vs. distance-based vs. rank-based). Why might different ordination techniques produce different patterns, and what does this tell you about the nature of biome-climate relationships?
K-means is a partitioning method that groups objects into a pre-specified number of clusters (k) by minimizing within-cluster variation. The algorithm iteratively assigns objects to the nearest cluster centroid and recalculates centroids until convergence.
Why use ordination spaces for clustering?
Ordination reduces noise and collinearity in the data, making patterns more distinct. Clustering in ordination space often produces more interpretable and stable groups than clustering in the original high-dimensional space.
What we will do:
Apply K-means clustering to PCA, PCoA, and NMDS ordination spaces
Compare classifications across different ordination methods
Determine the optimal number of clusters for PCoA using multiple criteria
Test whether clusters are statistically different using PERMANOVA
Assess classification accuracy relative to known biome types
6.2 Step 4.1: K-means on All Three Ordination Spaces
What this step does: This step applies K-means clustering to each of the three ordination spaces (PCA, PCoA, and NMDS) we created in previous tasks. We set the number of clusters equal to the number of known biomes in our dataset to see how well unsupervised clustering can recover the known ecological classifications.
Why we do this: By comparing clustering across different ordination methods, we can assess which ordination technique best captures the natural groupings in the climate-vegetation data. Each ordination method has different properties (PCA assumes linear relationships, PCoA preserves distances, NMDS preserves rank orders), so they may produce different clustering results.
Key outputs: - Total within-cluster SS: Measures how compact the clusters are (lower is better) - Between-cluster SS: Measures how separated the clusters are (higher is better) - Ratio (between/total): The proportion of variance explained by the clustering (higher values indicate better-defined clusters)
# Number of known biomesn_biomes <-length(unique(full_data$BIOME.Name))cat("Number of biomes in dataset:", n_biomes, "\n")
Number of biomes in dataset: 16
# K-means on PCA space (using retained axes)set.seed(42)kmeans_pca <-kmeans(pc_scores[,1:n_axes_retain], centers = n_biomes, nstart =25)# K-means on PCoA spaceset.seed(42)kmeans_pcoa <-kmeans(pcoa_scores[, 1:n_axes_pcoa], centers = n_biomes, nstart =25)# K-means on NMDS space (using 5 dimensions)set.seed(42)kmeans_nmds <-kmeans(nmds_scores, centers = n_biomes, nstart =25)# Summary of resultscat("\n--- K-means Results ---\n")
--- K-means Results ---
cat("PCA - Total within-cluster SS:", round(kmeans_pca$tot.withinss, 2), "\n")
cat("NMDS - Ratio (between/total):", round(kmeans_nmds$betweenss / kmeans_nmds$totss, 3), "\n")
NMDS - Ratio (between/total): 0.86
6.3 Step 4.2: Visualize K-means Classifications
What this step does: Creates side-by-side visualizations showing the original biome classifications (left panels) and the K-means cluster assignments (right panels) for each ordination method. Black asterisks mark the cluster centroids (the center points of each cluster).
Why we do this: Visual comparison allows us to quickly assess how well the unsupervised K-means clustering recovers the known biome patterns. Good clustering should show distinct, separated groups that align reasonably well with the original biome classifications.
What to look for: - Do the K-means clusters form distinct groups or overlap extensively? - Are the cluster centroids well-separated from each other? - Do the clustering patterns look similar across the three ordination methods? - Where K-means clusters align with biome boundaries and where they differ?
6.4 Step 4.3: Determine Optimal Number of Clusters (PCoA space)
What this step does: Tests different numbers of clusters (k = 2 to 15) using two methods to find the “natural” number of groups in the data: 1. Elbow method: Plots how within-cluster variation decreases as k increases; the “elbow” (point where improvement slows) suggests optimal k 2. Silhouette method: Calculates how well each point fits its assigned cluster compared to other clusters; values range from -1 (poor fit) to +1 (excellent fit)
Why we do this: The true number of ecological groups may differ from the number of named biomes. These methods help us discover the number of clusters that best fits the underlying data structure without being constrained by pre-existing classification schemes.
What to look for: - In the elbow plot: Where does the curve flatten out? - In the silhouette plot: Which k value gives the highest average silhouette score? - Do these methods agree on an optimal k?
# Test different numbers of clustersk_values <-2:15wss <-numeric(length(k_values))silhouette_scores <-numeric(length(k_values))set.seed(42)for (i inseq_along(k_values)) { k <- k_values[i] km_temp <-kmeans(pcoa_scores[, 1:n_axes_pcoa], centers = k, nstart =25) wss[i] <- km_temp$tot.withinss# Calculate silhouette score cluster_assignments <- km_temp$cluster sil_scores <-numeric(nrow(pcoa_scores))for (j in1:nrow(pcoa_scores)) {# Calculate average distance to own cluster own_cluster <- cluster_assignments[j] own_cluster_points <- pcoa_scores[cluster_assignments == own_cluster, 1:n_axes_pcoa] a_i <-mean(sqrt(rowSums((sweep(own_cluster_points, 2, pcoa_scores[j, 1:n_axes_pcoa]))^2)))# Calculate average distance to nearest cluster other_clusters <-unique(cluster_assignments[cluster_assignments != own_cluster]) b_values <-numeric(length(other_clusters))for (c_idx inseq_along(other_clusters)) { other_cluster_points <- pcoa_scores[cluster_assignments == other_clusters[c_idx], 1:n_axes_pcoa] b_values[c_idx] <-mean(sqrt(rowSums((sweep(other_cluster_points, 2, pcoa_scores[j, 1:n_axes_pcoa]))^2))) } b_i <-min(b_values) sil_scores[j] <- (b_i - a_i) /max(a_i, b_i) } silhouette_scores[i] <-mean(sil_scores)}# Plot resultspar(mfrow =c(1, 2), mar =c(4, 4, 3, 2))# Elbow plotplot(k_values, wss,type ="b", pch =19, col ="blue",xlab ="Number of Clusters (k)",ylab ="Total Within-Cluster Sum of Squares",main ="Elbow Method")grid()# Silhouette plotplot(k_values, silhouette_scores,type ="b", pch =19, col ="darkgreen",xlab ="Number of Clusters (k)",ylab ="Average Silhouette Score",main ="Silhouette Method")abline(h =0, lty =2, col ="red")grid()
# Identify optimal koptimal_k_silhouette <- k_values[which.max(silhouette_scores)]cat("\nOptimal number of clusters (Silhouette):", optimal_k_silhouette, "\n")
6.5 Step 4.4: Run K-means with Optimal Number of Clusters
What this step does: Performs K-means clustering using the optimal number of clusters identified in Step 4.3. Creates visualizations comparing this data-driven clustering approach with the original biome classifications, and generates a confusion matrix showing how cluster assignments correspond to actual biomes.
Why we do this: This allows us to compare a classification based purely on the data’s natural structure versus the traditional expert-defined biome system. The confusion matrix reveals which biomes are most distinct and which tend to blend together based on climate variables.
What to look for: - How does the optimal k compare to the number of original biomes? - In the confusion matrix: Are certain biomes consistently assigned to single clusters, or are they split across multiple clusters? - Do some clusters contain sites from multiple different biomes?
What this step does: Calculates how accurately each clustering method recovers the original biome classifications. For each cluster, the code identifies which biome is most common in that cluster and assigns all members to that biome. Accuracy is the percentage of sites correctly classified.
Why we do this: This quantifies the agreement between unsupervised clustering (based purely on climate patterns) and expert-defined biomes. High accuracy suggests biomes are climatically distinct; lower accuracy might indicate that biome boundaries are influenced by factors beyond climate, or that biomes exist along continuous gradients rather than as discrete categories.
What to look for: - Which ordination method produces the most accurate clustering? - How does accuracy change with the optimal k versus the original number of biomes? - Which biomes are most consistently assigned to single clusters? - Which biomes are frequently confused with each other?
# Calculate classification accuracy for each methodcalculate_accuracy <-function(clusters, true_biomes) {# Find best matching between clusters and biomes confusion <-table(clusters, true_biomes)# For each cluster, assign it to the most common biome cluster_to_biome <-apply(confusion, 1, function(x) colnames(confusion)[which.max(x)])# Calculate accuracy predicted_biomes <- cluster_to_biome[as.character(clusters)] accuracy <-sum(predicted_biomes == true_biomes) /length(true_biomes)return(list(accuracy = accuracy, cluster_assignment = cluster_to_biome,confusion = confusion))}# Calculate accuraciesacc_pca <-calculate_accuracy(kmeans_pca$cluster, full_data$BIOME.Name[as.numeric(row.names(env_vars))])acc_pcoa <-calculate_accuracy(kmeans_pcoa$cluster, full_data$BIOME.Name[as.numeric(row.names(env_vars))])acc_nmds <-calculate_accuracy(kmeans_nmds$cluster, full_data$BIOME.Name[as.numeric(row.names(env_vars))])acc_optimal <-calculate_accuracy(kmeans_optimal$cluster, full_data$BIOME.Name[as.numeric(row.names(env_vars))])cat("\n--- Classification Accuracy ---\n")
Question 4.1: Compare the classification accuracies across the three ordination methods (PCA, PCoA, NMDS). Which ordination space produced the most accurate K-means clustering? Why might certain ordination methods be more suitable for classification tasks than others?
Question 4.2: You determined an optimal number of clusters using the silhouette method, which may differ from the actual number of biomes in the dataset. What does this suggest about the natural structure of vegetation types based on climate data? Discuss whether biomes are truly discrete categories or whether they represent points along continuous environmental gradients. How might this impact conservation and vegetation mapping efforts?
7 Summary and Conclusions
7.1 Key Findings
Today you have learned to:
Apply ordination techniques - PCA, PCoA, and NMDS each reveal different aspects of how environmental variables structure vegetation distributions
Determine dimensionality - Using scree plots, broken stick models, and stress values to select appropriate numbers of axes
Classify ecological data - K-means clustering can identify natural groupings, with performance varying by ordination method
Validate classifications - PERMANOVA and accuracy metrics help assess whether groups are statistically and ecologically meaningful
7.2 Take-Home Messages
No single best method: Different ordination techniques have different strengths. PCA is best for linear relationships, PCoA for any dissimilarity measure, and NMDS for non-linear patterns.
Dimensionality matters: Retaining too few axes loses information; retaining too many adds noise.
Classification is challenging: Even with clear environmental gradients, perfectly classifying vegetation types is difficult because ecosystems exist along continua rather than in discrete boxes.
Statistical validation is essential: Always test whether your groups are significantly different and evaluate classification accuracy.
7.3 Further Exploration
Consider these extensions:
Use the classified model to predict vegetation types at unsampled locations using WorldClimateData.csv
Apply hierarchical clustering methods and compare to K-means
Investigate which bioclimatic variables best discriminate deciduous broadleaf forests
Explore random forests or other machine learning approaches for vegetation classification